Add DCP compatibility for FSDP2-TP sharding in TransformerEngine. by cspades · Pull Request #2713 · NVIDIA/TransformerEngine

cspades · 2026-02-26T23:43:13Z

Summary

Support FSDP2 + TP strided sharding, and DTensor parameters with FP8, across all TransformerEngineBaseModule(s).
Associated with BioNeMo-Recipes Llama3 TP: Enable TransformerEngine-backed Tensor Parallelism with Llama3. bionemo-framework#1483

Details

Fix bug where "shard" was the presumed weight sharding sub-mesh in the DTensor.device_mesh. Now, users can precisely specify their own custom weight-sharding DeviceMesh for per-tensor amax_reduction_group via the set_device_mesh API.

Testing

Megatron CI/CD Testing via Dummy PR: [DO NOT MERGE] Dummy commit/PR to test TE DCP modifications. Megatron-LM#3661
- There is one golden value zero gradient test failure that is common to both main and cspades:cye/fsdp2-tp-dcp so we can assume it is not associated to my change: https://github.com/NVIDIA/Megatron-LM/actions/runs/22637904520/job/65636890955?pr=3661 (TransformerEngine main)
Megatron-LM Llama 8B (TP=4) Parity Tests: main vs. cspades:cye/fsdp2-tp-dcp with Megatron-LM main on PyTorch 25.11

# TransformerEngine Main
[Rank 0] (after 1 iterations) memory (MB) | allocated: 23511.65 | max allocated: 25189.68 | reserved: 25678.00 | max reserved: 25678.00
 [2026-03-02 09:55:17.189564] iteration       99/15258789 | consumed samples:        12672 | elapsed time per iteration (ms): 12715.7 | throughput per GPU (TFLOP/s/GPU): 530.6 | learning rate: 4.866046E-07 | global batch size:   128 | lm loss: 1.124915E+00 | loss scale: 1.0 | grad norm: 5.474 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2026-03-02 09:55:29.768521] iteration      100/15258789 | consumed samples:        12800 | elapsed time per iteration (ms): 12578.7 | throughput per GPU (TFLOP/s/GPU): 536.4 | learning rate: 4.915198E-07 | global batch size:   128 | lm loss: 1.143806E+00 | loss scale: 1.0 | grad norm: 5.366 | number of skipped iterations:   0 | number of nan iterations:   0 |

# Post-DCP Modifications (This PR)
[Rank 0] (after 2 iterations) memory (MB) | allocated: 23511.65 | max allocated: 29783.24 | reserved: 25678.00 | max reserved: 31510.00
 [2026-03-02 09:29:36.550070] iteration       99/15258789 | consumed samples:        12672 | elapsed time per iteration (ms): 12556.5 | throughput per GPU (TFLOP/s/GPU): 537.3 | learning rate: 4.866046E-07 | global batch size:   128 | lm loss: 1.124463E+00 | loss scale: 1.0 | grad norm: 5.471 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2026-03-02 09:29:49.216068] iteration      100/15258789 | consumed samples:        12800 | elapsed time per iteration (ms): 12665.7 | throughput per GPU (TFLOP/s/GPU): 532.7 | learning rate: 4.915198E-07 | global batch size:   128 | lm loss: 1.142863E+00 | loss scale: 1.0 | grad norm: 5.355 | number of skipped iterations:   0 | number of nan iterations:   0 |

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-03-04T17:30:47Z

Greptile Summary

This PR adds DCP (Distributed Checkpoint) compatibility for FSDP2+TP strided sharding across all TransformerEngineBaseModule subclasses. It introduces a uniform set_device_mesh(tp_mesh, weight_mesh) API that converts model weights into DTensors for seamless integration with torch.distributed.checkpoint, and a _convert_param_to_dtensor_param helper in distributed.py that wraps a parameter as a DTensor while preserving custom attributes (e.g. param_init_meta, tensor_model_parallel). The PR also fixes a pre-existing bug where the amax-reduction group for Float8CurrentScalingQuantizer assumed a hard-coded "shard" mesh dimension, replaces it with a configurable weight_mesh API, and adds a full DCP save/load round-trip test in run_fsdp2_model.py.

Key issues found:

args.sharding_dims iterated unconditionally in CKPT_DIR path (run_fsdp2_model.py:537): '_'.join(str(x) for x in args.sharding_dims) raises TypeError when --sharding-dims is not provided; needs a (args.sharding_dims or []) guard.
Incorrect shard dimension for 3-D grouped DTensor in GroupedLinear (grouped_linear.py:806–813): The placements from individual 2-D weights (e.g., Shard(dim=0) for column-parallel) are reused verbatim when constructing the 3-D grouped tensor (num_gemms, out_features, in_features). In the 3-D context, Shard(dim=0) refers to num_gemms, not out_features; DCP would reconstruct the checkpoint using the wrong axis. The shard dimensions need to be incremented by 1 to account for the prepended num_gemms dimension.

Confidence Score: 2/5

Two critical correctness bugs prevent this PR from merging: None-check missing for sharding_dims at line 537, and incorrect DTensor shard dimensions for grouped weights in DCP reconstruction.
The core FSDP2-TP DTensor mechanism is sound, and the broad set_device_mesh refactor is well-structured and consistent across modules. However, two bugs are showstoppers: (1) a runtime crash when FSDP is run without explicit --sharding-dims, and (2) a silent correctness bug where DCP reconstruction uses wrong shard dimensions for grouped weights in TP-sharded models (e.g., MoE-style). Both are straightforward to fix but block merge.
tests/pytorch/distributed/run_fsdp2_model.py (line 537: None-check for args.sharding_dims) and transformer_engine/pytorch/module/grouped_linear.py (line 806-813: shard dimension adjustment for 3-D grouped DTensor)

_{Last reviewed commit: fcdd5bd}

greptile-apps · 2026-03-04T17:30:51Z

tests/pytorch/distributed/run_fsdp2_model.py

    if args.sharding_dims:
-        assert len(args.sharding_dims) <= 2
+        assert len(args.sharding_dims) <= 3
+    if len(args.sharding_dims) >= 3:
+        # Set the TP size in args.
+        args.tp_size = args.sharding_dims[2]
+    else:
+        args.tp_size = 1
    return args


args.sharding_dims not guarded against None

At line 153, len(args.sharding_dims) is called unconditionally, but args.sharding_dims can be None when the --sharding-dims flag is omitted (since the argument is not marked required=True and uses nargs="+"). This will raise TypeError: object of type 'NoneType' has no len().

The if len(args.sharding_dims) >= 3: block should be nested inside the existing if args.sharding_dims: guard:

Suggested change

if args.sharding_dims:

assert len(args.sharding_dims) <= 2

assert len(args.sharding_dims) <= 3

if len(args.sharding_dims) >= 3:

# Set the TP size in args.

args.tp_size = args.sharding_dims[2]

else:

args.tp_size = 1

return args

if args.sharding_dims:

assert len(args.sharding_dims) <= 3

if len(args.sharding_dims) >= 3:

# Set the TP size in args.

args.tp_size = args.sharding_dims[2]

else:

args.tp_size = 1

else:

args.tp_size = 1

greptile-apps · 2026-03-04T17:30:52Z

transformer_engine/pytorch/module/layernorm_mlp.py

+            self.fc1_bias = _convert_param_to_dtensor_param(
+                self.fc1_bias, tp_mesh, placements=(Shard(dim=0),)
+            )
+            # FC2 Weight -> Row-Parallel -> Shard(dim=1)
+            self.fc2_weight = _convert_param_to_dtensor_param(
+                self.fc2_weight, tp_mesh, placements=(Shard(dim=1),)
+            )
+            # LN & FC2 Bias -> Replicate()
+            self.fc2_bias = _convert_param_to_dtensor_param(
+                self.fc2_bias, tp_mesh, placements=(Replicate(),)
+            )


Bias converted unconditionally when use_bias=False

When use_bias=False, self.fc1_bias and self.fc2_bias are initialized as plain torch.Tensor objects (not nn.Parameter, see lines 1940 and 1958):

else: self.fc1_bias = torch.Tensor().to(dtype=params_dtype, device=device)

Calling _convert_param_to_dtensor_param on them returns nn.Parameter(DTensor.from_local(...)). When this is then assigned back via self.fc1_bias = new_param, PyTorch's Module.__setattr__ will detect the nn.Parameter type and register the bias as a named module parameter, even though biases are disabled. This would pollute model.named_parameters(), the optimizer parameter list, and checkpoint state.

The fix is to guard these two conversions behind if self.use_bias:, following the same pattern already used for layer_norm_bias at line 2091.

greptile-apps · 2026-03-04T17:30:53Z

transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py

+        weight_mesh : Optional[DeviceMesh]
+            Not used for DotProductAttention as there are no quantized weights.
+        """
+        warnings.warn(f"weight_mesh not necessary for {self.__class__.__name__}: {weight_mesh}")


Spurious warning when weight_mesh is None

warnings.warn(...) is emitted unconditionally every time set_device_mesh is called, even when weight_mesh=None. The calling code invokes this method whenever tp_mesh is not None or weight_mesh is not None, so a normal call with only tp_mesh provided will generate a misleading warning like "weight_mesh not necessary for DotProductAttention: None".

The warning should only fire when the caller explicitly passes a non-None weight_mesh. The same spurious warning exists in transformer_engine/pytorch/module/layernorm.py (line 171) and transformer_engine/pytorch/module/rmsnorm.py (line 174).

Suggested change

warnings.warn(f"weight_mesh not necessary for {self.__class__.__name__}: {weight_mesh}")

if weight_mesh is not None:

warnings.warn(f"weight_mesh not necessary for {self.__class__.__name__}: {weight_mesh}")

Signed-off-by: Cory Ye <cye@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Cory Ye <cye@nvidia.com>

greptile-apps · 2026-03-04T18:20:21Z

tests/pytorch/distributed/run_fsdp2_model.py

@@ -30,6 +38,61 @@
 LOCAL_RANK = None


+@dataclass
+class AppState(Stateful):
+    """AppState for FSDP2 checkpoint via Torch DCP.
+
+    Adapted from https://docs.pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html
+    """
+
+    model: torch.nn.Module
+    optimizer: torch.optim.Optimizer
+
+    def state_dict(self):
+        """
+        Get the state dict for the model, optimizer, scheduler, and step.
+        This factory both retrieves the model state dictionary when saving
+        checkpoints and initializes a destination for the state read from
+        DCP checkpoint files when loading checkpoints.
+        """
+        model_state_dict, optimizer_state_dict = get_state_dict(self.model, self.optimizer)
+        for fqn in list(model_state_dict.keys()):
+            # Get the model parameter.
+            model_param = model_state_dict[fqn]
+            if isinstance(model_param, DTensor):
+                model_param = model_param.to_local()
+            if model_param.numel() == 0 and fqn in optimizer_state_dict["state"]:
+                # Empty model parameter. Clear the associated optimizer state
+                # when initializing the optimizer state upon DCP load, because
+                # empty optimizer state DTensors are not checkpointed with DCP,
+                # yet get_state_dict / _init_optim_state produce empty Tensors.
+                # TransformerEngine uses empty Tensors for dummy Parameters.
+                optimizer_state_dict["state"][fqn] = {}
+            if fqn.endswith("._extra_state"):
+                # Evict `_extra_state` quantization data from model checkpoint.
+                model_state_dict.pop(fqn)
+        return {
+            "model": model_state_dict,
+            "optim": optimizer_state_dict,
+        }
+
+    def load_state_dict(self, state_dict: dict):
+        """
+        Load the state dict for the model, optimizer, scheduler, and step.
+        Given the checkpoint-loaded state_dict, set the state of the model,
+        optimizer, scheduler, step, and epoch to the values in state_dict.
+        """
+        set_state_dict(
+            self.model,
+            self.optimizer,
+            model_state_dict=state_dict["model"],
+            optim_state_dict=state_dict["optim"],
+            # Non-strict checkpoint loading ignores empty optimizer states,
+            # skips loading non-FP8 checkpoint weights (e.g. _extra_state).
+            options=StateDictOptions(strict=False),
+        )
+
+
 def dist_print(msg):
    if LOCAL_RANK == 0:


DCP checkpoint functionality is not exercised in the test

The AppState class (lines 42–93) and DCP checkpoint operations (save, load, get_state_dict, set_state_dict) are imported and fully implemented, but the training loop in _train() (lines 480–490) does not call any checkpoint save/load operations. The function ends at line 497 with dist.destroy_process_group() and no checkpoint round-trip.

Since the PR title is "Add DCP compatibility for FSDP2-TP sharding," the checkpoint functionality is the headline feature. Without an actual save/load call in the test, neither the AppState.state_dict() eviction logic nor the set_state_dict(strict=False) reload path is validated.

Recommendation: Add a checkpoint save/load round-trip after the training loop (before dist.destroy_process_group()) to exercise the DCP functionality, or explicitly note in the test docstring that DCP round-trip testing is deferred to integration tests.

Working on it! + GroupedLinear test case.

Signed-off-by: Cory Ye <cye@nvidia.com>

greptile-apps · 2026-03-05T01:11:56Z

tests/pytorch/distributed/run_fsdp2_model.py

+    CKPT_DIR = (
+        Path(SHARED_TMP_DIR)
+        / "run_fsdp2_model"
+        / f"dcp-{'_'.join(str(x) for x in args.sharding_dims)}-{args.layer_type}-{args.recipe}-fp8_init_{args.fp8_init}"


args.sharding_dims can be None when --sharding-dims is not passed (the argument uses nargs="+" without required=True). The f-string on this line iterates over it directly, which will raise TypeError: 'NoneType' object is not iterable in that case.

Suggested change

/ f"dcp-{'_'.join(str(x) for x in args.sharding_dims)}-{args.layer_type}-{args.recipe}-fp8_init_{args.fp8_init}"

/ f"dcp-{'_'.join(str(x) for x in (args.sharding_dims or []))}-{args.layer_type}-{args.recipe}-fp8_init_{args.fp8_init}"

greptile-apps · 2026-03-05T01:11:58Z

transformer_engine/pytorch/module/grouped_linear.py

+            grouped_param = _convert_param_to_dtensor_param(
+                grouped_param,
+                device_mesh=dtensor_member_param.device_mesh,
+                placements=dtensor_member_param.placements,
+                # DTensor / DCP will view this as a TP-sharded 3-D Tensor.
+                shape=(self.num_gemms, self.out_features, self.in_features),
+                # Default Stride: (out*in, in, 1)
+                stride=None,


dtensor_member_param.placements was assigned in set_device_mesh relative to the 2-D weight shape (out_features, in_features):

Column-parallel → Shard(dim=0) (shard over out_features)

Row-parallel → Shard(dim=1) (shard over in_features)

But the global shape here is the 3-D tensor (num_gemms, out_features, in_features). Reusing the same placements verbatim means:

Shard(dim=0) now refers to the num_gemms axis — wrong, each TP rank holds all gemms.

Shard(dim=1) would refer to out_features when the row-parallel split is actually on in_features (dim=2).

As a result, when DCP reconstructs the full checkpoint from the local shards it will use the wrong axis, producing silently corrupted weight tensors.

The shard dimensions need to be incremented by 1 to account for the prepended num_gemms dimension:

from torch.distributed.tensor import Shard as _Shard, Replicate as _Replicate adjusted_placements = tuple( _Shard(p.dim + 1) if isinstance(p, _Shard) else p for p in dtensor_member_param.placements ) grouped_param = _convert_param_to_dtensor_param( grouped_param, device_mesh=dtensor_member_param.device_mesh, placements=adjusted_placements, shape=(self.num_gemms, self.out_features, self.in_features), stride=None, )

Okay, this is an impressive catch. I wrote this code too quickly. Will fix!

This was referenced Feb 27, 2026

Enable TransformerEngine-backed Tensor Parallelism with Llama3. NVIDIA/bionemo-framework#1483

Draft

[DO NOT MERGE] Dummy commit/PR to test TE DCP modifications. NVIDIA/Megatron-LM#3661

Draft

cspades force-pushed the cye/fsdp2-tp-dcp branch 4 times, most recently from 50da1dc to 925d022 Compare March 3, 2026 18:14

cspades marked this pull request as ready for review March 4, 2026 17:23

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

cspades and others added 4 commits March 4, 2026 10:10

Add DCP compatibility for FSDP2-TP sharding in TransformerEngine.

c6824e1

Signed-off-by: Cory Ye <cye@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

ae12181

for more information, see https://pre-commit.ci

Lint.

ccc0abc

Signed-off-by: Cory Ye <cye@nvidia.com>

Add TP to FSDP and HSDP functional tests.

dbb9d14

Signed-off-by: Cory Ye <cye@nvidia.com>

cspades force-pushed the cye/fsdp2-tp-dcp branch from 4ec2947 to dbb9d14 Compare March 4, 2026 18:10

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

Extend FSDP2 unit tests to include DCP checkpointing and parity tests.

fcdd5bd

Signed-off-by: Cory Ye <cye@nvidia.com>

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DCP compatibility for FSDP2-TP sharding in TransformerEngine.#2713

Add DCP compatibility for FSDP2-TP sharding in TransformerEngine.#2713
cspades wants to merge 5 commits intoNVIDIA:mainfrom
cspades:cye/fsdp2-tp-dcp

cspades commented Feb 26, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Mar 4, 2026

Uh oh!

greptile-apps bot Mar 4, 2026

Uh oh!

greptile-apps bot Mar 4, 2026

Uh oh!

greptile-apps bot Mar 4, 2026

Uh oh!

cspades Mar 4, 2026

Uh oh!

greptile-apps bot Mar 5, 2026

Uh oh!

greptile-apps bot Mar 5, 2026

Uh oh!

cspades Mar 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	warnings.warn(f"weight_mesh not necessary for {self.__class__.__name__}: {weight_mesh}")
	if weight_mesh is not None:
	warnings.warn(f"weight_mesh not necessary for {self.__class__.__name__}: {weight_mesh}")

	/ f"dcp-{'_'.join(str(x) for x in args.sharding_dims)}-{args.layer_type}-{args.recipe}-fp8_init_{args.fp8_init}"
	/ f"dcp-{'_'.join(str(x) for x in (args.sharding_dims or []))}-{args.layer_type}-{args.recipe}-fp8_init_{args.fp8_init}"

Conversation

cspades commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Testing

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

cspades Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

cspades Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cspades commented Feb 26, 2026 •

edited

Loading

greptile-apps bot commented Mar 4, 2026 •

edited

Loading

cspades Mar 5, 2026 •

edited

Loading